Goto

Collaborating Authors

 stochastic game


On the convergence of policy gradient methods to Nash equilibria in general stochastic games Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

Multi-agent learning in stochastic N-player games is a notoriously difficult problem1 because, in addition to their changing strategic decisions, the players of the game2 must also contend with the fact that the game itself evolves over time, possibly in a3 very complicated manner. Because of this, the equilibrium convergence properties4 of popular learning algorithms - like policy gradient and its variants - are poorly5 understood, except in specific classes of games (such as potential or two-player,6 zero-sum games). In view of all this, we examine the long-run behavior of policy7 gradient methods with respect to Nash equilibrium policies that are second-order8 stationary (SOS) in a sense similar to the type of KKT sufficiency conditions9 used in optimization. Our analysis shows that SOS policies are locally attracting10 with high probability, and we show that policy gradient trajectories with gradient11 estimates provided by the Reinforcealgorithm achieve an O(1/ n) convergence12 rate to such equilibria if the method's step-size is chosen appropriately.



On the Global Convergence Rates of Decentralized Softmax Gradient Play in Markov Potential Games

Neural Information Processing Systems

Softmax policy gradient is a popular algorithm for policy optimization in singleagent reinforcement learning, particularly since projection is not needed for each gradient update. However, in multi-agent systems, the lack of central coordination introduces significant additional difficulties in the convergence analysis. Even for a stochastic game with identical interest, there can be multiple Nash Equilibria (NEs), which disables proof techniques that rely on the existence of a unique global optimum. Moreover, the softmax parameterization introduces non-NE policies with zero gradient, making it difficult for gradient-based algorithms in seeking NEs. In this paper, we study the finite time convergence of decentralized softmax gradient play in a special form of game, Markov Potential Games (MPGs), which includes the identical interest game as a special case. We investigate both gradient play and natural gradient play, with and without log-barrier regularization. The established convergence rates for the unregularized cases contain a trajectory dependent constant that can be arbitrarily large, whereas the log-barrier regularization overcomes this drawback, with the cost of slightly worse dependence on other factors such as the action set size. An empirical study on an identical interest matrix game confirms the theoretical findings.


Unified

Neural Information Processing Systems

Policy optimization, i.e. algorithms that learn to make sequential decisions by local search on the agent's policy directly, is a widely used class of algorithms in reinforcement learning [40, 44, 45].